Goto

Collaborating Authors

 Cache County



The Generalized Proximity Forest

Shaw, Ben, Rustad, Adam, Maia, Sofia Pelagalli, Rhodes, Jake S., Moon, Kevin R.

arXiv.org Machine Learning

Abstract--Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur . Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier . We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the k-nearest neighbors model.


TAWRMAC: A Novel Dynamic Graph Representation Learning Method

Farokhi, Soheila, Qi, Xiaojun, Karimi, Hamid

arXiv.org Artificial Intelligence

Dynamic graph representation learning has become essential for analyzing evolving networks in domains such as social network analysis, recommendation systems, and traffic analysis. However, existing continuous-time methods face three key challenges: (1) some methods depend solely on node-specific memory without effectively incorporating information from neighboring nodes, resulting in embedding staleness; (2) most fail to explicitly capture correlations between node neighborhoods, limiting contextual awareness; and (3) many fail to fully capture the structural dynamics of evolving graphs, especially in absence of rich link attributes. To address these limitations, we introduce TAWRMAC-a novel framework that integrates Temporal Anonymous Walks with Restart, Memory Augmentation, and Neighbor Co-occurrence embedding. TAWRMAC enhances embedding stability through a memory-augmented GNN with fixedtime encoding and improves contextual representation by explicitly capturing neighbor correlations. Additionally, its Temporal Anonymous Walks with Restart mechanism distinguishes between nodes exhibiting repetitive interactions and those forming new connections beyond their immediate neighborhood. This approach captures structural dynamics better and supports strong inductive learning. Extensive experiments on multiple benchmark datasets demonstrate that TAWRMAC consistently outperforms state-of-the-art methods in dynamic link prediction and node classification under both transductive and inductive settings across three different negative sampling strategies. By providing stable, generalizable, and context-aware embeddings, TAWRMAC advances the state of the art in continuous-time dynamic graph learning. The code is available at https://anonymous.4open.science/r/tawrmac-A253 .



Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

Bladen, Kelvyn K., Cutler, D. Richard, Wisler, Alan

arXiv.org Machine Learning

In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to $t$-statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: $\text{PaP}_i = β_i \sqrt{2\operatorname{Var}(\mathbf{x}^v_i)}$, while LOCO is proportional to the coefficient but dampened by collinearity (captured by $Δ$): $\text{LOCO}_i = β_i (1 -Δ)\sqrt{1 + c}$. These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.


Label-Guided Imputation via Forest-Based Proximities for Improved Time Series Classification

Rhodes, Jake S., Rustad, Adam G., Maia, Sofia Pelagalli, Thacker, Evan, Choi, Hyunmi, Gutierrez, Jose, Rundek, Tatjana, Shaw, Ben

arXiv.org Machine Learning

Missing data is a common problem in time series data. Most methods for imputation ignore label information pertaining to the time series even if that information exists. In this paper, we provide a framework for missing data imputation in the context of time series classification, where each time series is associated with a categorical label. We define a means of imputing missing values conditional upon labels, the method being guided by powerful, existing supervised models designed for high accuracy in this task. From each model, we extract a tree-based proximity measure from which imputation can be applied. We show that imputation using this method generally provides richer information leading to higher classification accuracies, despite the imputed values differing from the true values.


TIMED: Adversarial and Autoregressive Refinement of Diffusion-Based Time Series Generation

EskandariNasab, MohammadReza, Hamdi, Shah Muhammad, Boubrahimi, Soukaina Filali

arXiv.org Artificial Intelligence

Generating high-quality synthetic time series is a fundamental yet challenging task across domains such as forecasting and anomaly detection, where real data can be scarce, noisy, or costly to collect. Unlike static data generation, synthesizing time series requires modeling both the marginal distribution of observations and the conditional temporal dependencies that govern sequential dynamics. We propose TIMED, a unified generative framework that integrates a denoising diffusion probabilistic model (DDPM) to capture global structure via a forward-reverse diffusion process, a supervisor network trained with teacher forcing to learn autoregressive dependencies through next-step prediction, and a Wasserstein critic that provides adversarial feedback to ensure temporal smoothness and fidelity. To further align the real and synthetic distributions in feature space, TIMED incorporates a Maximum Mean Discrepancy (MMD) loss, promoting both diversity and sample quality. All components are built using masked attention architectures optimized for sequence modeling and are trained jointly to effectively capture both unconditional and conditional aspects of time series data. Experimental results across diverse multivariate time series benchmarks demonstrate that TIMED generates more realistic and temporally coherent sequences than state-of-the-art generative models.


BiasMap: Leveraging Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation

Chakraborty, Rajatsubhra, Che, Xujun, Xu, Depeng, Faklaris, Cori, Niu, Xi, Yuan, Shuhan

arXiv.org Artificial Intelligence

Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribution maps of these concepts, we quantify the spatial demographics-semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness discovery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU during the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation.


An Evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator Learning Network

Lu, Binghang, Mou, Changhong, Lin, Guang

arXiv.org Artificial Intelligence

In this paper, we propose an evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator learning Network, which is a novel operator learning network to efficiently solve parametric partial differential equations. In forward and inverse settings, this operator learning network only admits minimum requirement of noisy observational data. While physics-informed neural networks and operator learning approaches such as Deep Operator Networks and Fourier Neural Operators offer promising alternatives to traditional numerical solvers, they struggle with balancing operator and physics losses, maintaining robustness under noisy or sparse data, and providing uncertainty quantification. The proposed framework addresses these limitations by integrating: (i) evolutionary multi-objective optimization to adaptively balance operator and physics-based losses in the Pareto front; (ii) replica exchange stochastic gradient Langevin dynamics to improve global parameter-space exploration and accelerate convergence; and (iii) built-in Bayesian uncertainty quantification from stochastic sampling. The proposed operator learning method is tested numerically on several different problems including one-dimensional Burgers equation and the time-fractional mixed diffusion-wave equation. The results indicate that our framework consistently outperforms the general operator learning methods in accuracy, noise robustness, and the ability to quantify uncertainty.


Computation- and Communication-Efficient Online FL for Resource-Constrained Aerial Vehicles

Pervej, Ferdous, Jin, Richeng, Chowdhury, Md Moin Uddin, Singh, Simran, Güvenç, İsmail, Dai, Huaiyu

arXiv.org Artificial Intelligence

Privacy-preserving distributed machine learning (ML) and aerial connected vehicle (ACV)-assisted edge computing have drawn significant attention lately. Since the onboard sensors of ACVs can capture new data as they move along their trajectories, the continual arrival of such 'newly' sensed data leads to online learning and demands carefully crafting the trajectories. Besides, as typical ACVs are inherently resource-constrained, computation- and communication-efficient ML solutions are needed. Therefore, we propose a computation- and communication-efficient online aerial federated learning (2CEOAFL) algorithm to take the benefits of continual sensed data and limited onboard resources of the ACVs. In particular, considering independently owned ACVs act as selfish data collectors, we first model their trajectories according to their respective time-varying data distributions. We then propose a 2CEOAFL algorithm that allows the flying ACVs to (a) prune the received dense ML model to make it shallow, (b) train the pruned model, and (c) probabilistically quantize and offload their trained accumulated gradients to the central server (CS). Our extensive simulation results show that the proposed 2CEOAFL algorithm delivers comparable performances to its non-pruned and nonquantized, hence, computation- and communication-inefficient counterparts.